-Briefly describe your methods for gathering the data. -Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the stargazer package for this. -Present a correlation matrix -Present 4 home price correlation scatterplots that you think are of interest. I’m going to look for interesting open data that you’ve integrated with the home sale observations. -Develop 1 map of your dependent variable (sale price) -Develop 3 maps of 3 of your most interesting independent variables. -Include any other maps/graphs/charts you think might be of interest.
The summary statistics of the full model are presented below.
| Dependent variable: | |
| price | |
| med_HH_Income | -1.925*** |
| (0.497) | |
| pct.over75K | 1,073,229.000*** |
| (150,718.600) | |
| pct.Information | -2,203,228.000*** |
| (360,495.200) | |
| pct.Finance | 1,594.924 |
| (262,495.100) | |
| pct.Professional | -181,448.900 |
| (152,270.900) | |
| pct.Ed_Health | -446,681.800*** |
| (135,859.400) | |
| nbrBedRoom | 8,919.376* |
| (5,277.324) | |
| nbrFullBaths | -20,167.680*** |
| (6,382.267) | |
| TotalFinishedSF | 159.369*** |
| (8.851) | |
| AcDscrEvaporative Cooler | 52,784.020 |
| (100,480.700) | |
| AcDscrNo AC | 53,806.290 |
| (96,560.950) | |
| AcDscrWhole House | 63,817.470 |
| (96,562.460) | |
| Age | 1,225.806*** |
| (226.321) | |
| schools_nn3 | -32.421*** |
| (4.985) | |
| trailheads_nn5 | -7.301 |
| (5.761) | |
| dist_FR | -14.970*** |
| (2.302) | |
| qualityCodeDscrAVERAGE + | -32,314.160* |
| (18,231.170) | |
| qualityCodeDscrAVERAGE ++ | 31,199.560* |
| (18,721.230) | |
| qualityCodeDscrEXCELLENT | 1,209,125.000*** |
| (42,347.180) | |
| qualityCodeDscrEXCELLENT + | 1,501,140.000*** |
| (94,512.870) | |
| qualityCodeDscrEXCELLENT++ | 2,053,472.000*** |
| (74,958.720) | |
| qualityCodeDscrEXCEPTIONAL 1 | 1,178,518.000*** |
| (94,737.590) | |
| qualityCodeDscrEXCEPTIONAL 2 | 1,989,543.000*** |
| (250,755.900) | |
| qualityCodeDscrFAIR | -74,775.010 |
| (45,893.580) | |
| qualityCodeDscrGOOD | 70,174.540*** |
| (13,001.650) | |
| qualityCodeDscrGOOD + | 109,881.300*** |
| (21,877.160) | |
| qualityCodeDscrGOOD ++ | 205,970.400*** |
| (19,933.570) | |
| qualityCodeDscrLOW | -124,112.600 |
| (97,682.940) | |
| qualityCodeDscrVERY GOOD | 300,628.500*** |
| (21,468.180) | |
| qualityCodeDscrVERY GOOD + | 617,758.800*** |
| (37,791.240) | |
| qualityCodeDscrVERY GOOD ++ | 681,368.300*** |
| (30,880.430) | |
| designCodeDscr2-3 Story | -27,609.520** |
| (11,101.130) | |
| designCodeDscrBi-level | 49,911.570** |
| (25,385.950) | |
| designCodeDscrMULTI STORY- TOWNHOUSE | -132,319.300*** |
| (15,917.080) | |
| designCodeDscrSplit-level | 19,440.710 |
| (16,834.630) | |
| ZipCode80025 | -364,412.500 |
| (283,330.700) | |
| ZipCode80026 | -283,686.300 |
| (216,120.600) | |
| ZipCode80027 | -260,814.200 |
| (216,935.800) | |
| ZipCode80301 | -163,691.500 |
| (217,934.200) | |
| ZipCode80302 | 156,545.000 |
| (219,405.000) | |
| ZipCode80303 | -105,060.900 |
| (218,506.400) | |
| ZipCode80304 | 143,416.500 |
| (219,907.900) | |
| ZipCode80305 | -79,906.970 |
| (219,864.600) | |
| ZipCode80403 | -127,448.300 |
| (227,619.200) | |
| ZipCode80422 | -37,943.930 |
| (264,007.600) | |
| ZipCode80455 | -215,880.600 |
| (232,920.400) | |
| ZipCode80466 | -224,580.500 |
| (217,922.700) | |
| ZipCode80471 | -382,909.300 |
| (490,554.700) | |
| ZipCode80481 | -65,293.540 |
| (225,752.000) | |
| ZipCode80501 | -372,582.700* |
| (216,143.100) | |
| ZipCode80503 | -423,489.700* |
| (216,770.900) | |
| ZipCode80504 | -385,146.600* |
| (216,204.100) | |
| ZipCode80510 | 290,204.600 |
| (235,334.200) | |
| ZipCode80516 | -406,437.100* |
| (216,585.600) | |
| ZipCode80540 | -309,398.300 |
| (222,118.100) | |
| ZipCode80544 | -393,245.800 |
| (305,797.900) | |
| Constant | 861,197.800*** |
| (245,352.800) | |
| Observations | 11,252 |
| R2 | 0.518 |
| Adjusted R2 | 0.516 |
| Residual Std. Error | 428,929.700 (df = 11195) |
| F Statistic | 214.978*** (df = 56; 11195) |
| Note: | p<0.1; p<0.05; p<0.01 |
What is the use of a correlation matrix? What correlations do we see?
numericVars <-
select_if(st_drop_geometry(boulder.sf), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables",
caption = "Figure 1.1")
What are the factors? Why did we choose them? What trends do we see?
st_drop_geometry(boulder.sf) %>%
dplyr::select(price, pct.over75K, pct.Professional, schools_nn3, trailheads_nn5) %>%
filter(price <= 4000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of continuous variables",
caption = "Figure 1.2") +
plotTheme()
# Price per square foot
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = boulder.sf, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder.sf,"price"),
name="Quintile\nBreaks") +
labs(title="Home Sale Prices, Boulder County",
caption = "Figure 1.3") +
mapTheme()
# Distance from the Front Range
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = FrontRange, colour = "#2d6a4f", size = 2) +
geom_sf(data = boulder_homes_observed, aes(colour = q5(dist_FR)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder_homes,"dist_FR"),
name="Quintile\nBreaks") +
labs(title="Home distance from Front Range",
caption = "Figure 1.4") +
mapTheme()
# Median Household Income in Boulder County
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = acsTractsBoulder.2019.sf, aes(fill = med_HH_Income)) +
scale_fill_viridis_b() +
labs(title = "Median Household Income in Boulder County",
subtitle = "by Census Tract",
caption = "Figure 1.5") +
mapTheme()
# Distance from the Front Range
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = boulder_schools, colour = "#2d6a4f") +
labs(title = "Schools in Boulder County",
caption = "Figure 1.6") +
mapTheme()
# Sale Price + zipcodes
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = acsTractsBoulder.2019.sf, fill = NA, colour = "#55286F") +
geom_sf(data = boulder_homes_observed, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder_homes,"price"),
name="Quintile\nBreaks") +
labs(title="Home Sale Prices + Zip Code Areas",
caption = "Figure 1.7") +
mapTheme()
Briefly describe your method (remember who your audience is).
The following summary table presents the results of the linear regression on the training data set.
| Dependent variable: | |
| price | |
| med_HH_Income | -2.076*** |
| (0.605) | |
| pct.over75K | 1,136,749.000*** |
| (183,606.500) | |
| pct.Information | -2,517,058.000*** |
| (435,785.800) | |
| pct.Finance | 66,818.700 |
| (317,737.900) | |
| pct.Professional | -156,801.300 |
| (186,024.500) | |
| pct.Ed_Health | -471,132.200*** |
| (164,677.900) | |
| nbrBedRoom | 8,849.475 |
| (6,480.865) | |
| nbrFullBaths | -19,258.620** |
| (7,773.679) | |
| TotalFinishedSF | 160.101*** |
| (10.639) | |
| AcDscrEvaporative Cooler | 63,043.200 |
| (108,443.400) | |
| AcDscrNo AC | 63,860.970 |
| (103,894.300) | |
| AcDscrWhole House | 60,861.390 |
| (103,894.800) | |
| Age | 1,109.136*** |
| (273.193) | |
| schools_nn3 | -28.477*** |
| (5.975) | |
| trailheads_nn5 | -7.063 |
| (7.010) | |
| dist_FR | -14.736*** |
| (2.792) | |
| qualityCodeDscrAVERAGE + | -34,740.310 |
| (21,886.280) | |
| qualityCodeDscrAVERAGE ++ | 21,304.040 |
| (22,337.710) | |
| qualityCodeDscrEXCELLENT | 1,142,633.000*** |
| (50,293.660) | |
| qualityCodeDscrEXCELLENT + | 1,317,237.000*** |
| (109,566.100) | |
| qualityCodeDscrEXCELLENT++ | 1,945,698.000*** |
| (85,852.900) | |
| qualityCodeDscrEXCEPTIONAL 1 | 1,150,923.000*** |
| (107,404.700) | |
| qualityCodeDscrEXCEPTIONAL 2 | 1,967,271.000*** |
| (270,178.200) | |
| qualityCodeDscrFAIR | -78,528.760 |
| (53,468.100) | |
| qualityCodeDscrGOOD | 62,221.240*** |
| (15,838.690) | |
| qualityCodeDscrGOOD + | 102,477.700*** |
| (26,319.790) | |
| qualityCodeDscrGOOD ++ | 210,331.600*** |
| (24,042.850) | |
| qualityCodeDscrLOW | -132,648.500 |
| (110,725.600) | |
| qualityCodeDscrVERY GOOD | 281,883.400*** |
| (26,001.690) | |
| qualityCodeDscrVERY GOOD + | 610,056.000*** |
| (44,912.940) | |
| qualityCodeDscrVERY GOOD ++ | 669,955.300*** |
| (36,930.920) | |
| designCodeDscr2-3 Story | -26,532.340** |
| (13,465.430) | |
| designCodeDscrBi-level | 43,376.240 |
| (29,920.470) | |
| designCodeDscrMULTI STORY- TOWNHOUSE | -130,140.200*** |
| (19,321.290) | |
| designCodeDscrSplit-level | 14,541.880 |
| (20,117.970) | |
| ZipCode80025 | -351,139.700 |
| (306,081.300) | |
| ZipCode80026 | -270,869.000 |
| (232,581.300) | |
| ZipCode80027 | -237,625.900 |
| (233,677.500) | |
| ZipCode80301 | -129,292.700 |
| (234,967.900) | |
| ZipCode80302 | 149,099.200 |
| (237,038.500) | |
| ZipCode80303 | -76,438.240 |
| (235,781.900) | |
| ZipCode80304 | 160,514.300 |
| (237,727.400) | |
| ZipCode80305 | -54,384.380 |
| (237,682.100) | |
| ZipCode80403 | -175,517.900 |
| (247,367.600) | |
| ZipCode80422 | -55,758.940 |
| (290,588.400) | |
| ZipCode80455 | -225,056.900 |
| (252,463.100) | |
| ZipCode80466 | -233,085.200 |
| (234,951.300) | |
| ZipCode80471 | -398,817.400 |
| (528,420.400) | |
| ZipCode80481 | -90,260.160 |
| (244,624.900) | |
| ZipCode80501 | -359,483.600 |
| (232,634.800) | |
| ZipCode80503 | -413,351.300* |
| (233,471.700) | |
| ZipCode80504 | -372,423.700 |
| (232,691.700) | |
| ZipCode80510 | 231,214.800 |
| (257,227.700) | |
| ZipCode80516 | -399,772.700* |
| (233,209.900) | |
| ZipCode80540 | -310,717.100 |
| (240,450.600) | |
| ZipCode80544 | -402,348.900 |
| (329,051.900) | |
| Constant | 851,203.400*** |
| (266,642.400) | |
| Observations | 8,793 |
| R2 | 0.485 |
| Adjusted R2 | 0.482 |
| Residual Std. Error | 460,683.200 (df = 8736) |
| F Statistic | 147.074*** (df = 56; 8736) |
| Note: | p<0.1; p<0.05; p<0.01 |
A summary of the mean absolute error and mean average percent error (MAPE) for the price prediction on the test data set is shown below.
Graphic representations of the results of the test set prediction are shown below.
# histogram of absolute errors
ggplot(boulder.test, aes(x = price.abserror)) +
geom_histogram(binwidth=10000, fill = "green", colour = "white") +
scale_x_continuous(limits = c(0, 1000000)) +
labs(title = "Distribution of prediction errors for single test",
x = "Sale Price Absolute Error", y = "Count") +
plotTheme()
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = st_drop_geometry(boulder.sf),
method = "lm", trControl = fitControl, na.action = na.pass)
K-fold cross validation with 100 folds is used to explore the generalizability of this model. A histogram of the mean average error across the 100 folds is shown below.
# histogram of cross validation MAE
mae <- data.frame(reg.cv$resample[,3]) %>%
rename(mae = reg.cv.resample...3.)
ggplot(mae, aes(x = mae)) +
geom_histogram(binwidth=10000, fill = "orange", colour = "white") +
scale_x_continuous(labels = c(0, 100000, 200000, 300000, 400000, 500000),
limits = c(0, 500000)) +
labs(title = "Distribution of MAE",
subtitle = "k-fold cross validation; k = 100",
x = "Mean Absolute Error", y = "Count") +
plotTheme()
The prices predicted for the test set are plotted against the actual sale prices for the test set in the figure below.
ggplot(boulder.test) +
geom_point(aes(price, price.predict)) +
geom_smooth(aes(price, price), colour = "orange") +
geom_smooth(method = "lm", aes(price, price.predict), se = FALSE, colour = "green") +
labs(title = "Predicted sale price as a function of observed price",
subtitle = "Orange line represents a perfect prediction; Green line represents prediction",
x = "Observed Sale Price", y = "Predicted Sale Price") +
plotTheme()
Map of residual absolute errors for the test set
ggplot() +
geom_sf(data = boulder_boundary, fill = "grey") +
geom_sf(data = boulder.test, aes(colour = q5(price.abserror)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder.test,"price.abserror"),
name="Quintile\nBreaks") +
labs(title="Test set absolute price errors",
caption = "Figure X.X") +
mapTheme()
Plot of the spatial lag in errors for the test set
coords.test <- st_coordinates(boulder.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
boulder.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.error)) %>%
ggplot(aes(lagPriceError, price.error)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
labs(title = "Error as a function of the spatial lag of price errors") +
plotTheme()
Plot of the Moran’s I test on the test set
moranTest <- moran.mc(boulder.test$price.error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
allPredictions <- boulder_homes %>%
mutate(predictions = predict(reg1, boulder_homes)) %>%
dplyr::select(predictions)
ggplot() +
geom_sf(data = boulder_boundary, fill = "grey") +
geom_sf(data = allPredictions, aes(colour = q5(predictions)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(allPredictions,"predictions"),
name="Quintile\nBreaks") +
labs(title="Predictions for all homes in the dataset, Boulder County",
caption = "Figure X.X") +
mapTheme()
There is something so strange going on in this one zip code.
st_drop_geometry(boulder.test) %>%
group_by(ZipCode) %>%
summarize(mean.MAPE = mean(price.ape, na.rm = T)) %>%
ungroup() %>%
left_join(boulder_zips) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = boulder.test, colour = "black", size = .5) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by Zip Code") +
mapTheme()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"
We need the scatterplot of MAPE by zip of mean price by zip
testError_by_zips <-
left_join(
st_drop_geometry(boulder.test) %>%
group_by(ZipCode) %>%
summarize(meanPrice = mean(price, na.rm = T)),
st_drop_geometry(boulder.test) %>%
group_by(ZipCode) %>%
summarize(MAPE = mean(price.ape)))
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"
testError_by_zips %>%
kable() %>% kable_styling()
| ZipCode | meanPrice | MAPE |
|---|---|---|
| 80026 | 628003.2 | 0.1814450 |
| 80027 | 702906.2 | 0.1624439 |
| 80301 | 841909.6 | 0.2166523 |
| 80302 | 1330729.5 | 0.3098138 |
| 80303 | 895975.3 | 0.1687126 |
| 80304 | 1422538.7 | 0.2876116 |
| 80305 | 883377.9 | 0.1771646 |
| 80403 | 436466.7 | 0.2104327 |
| 80422 | 785000.0 | 0.0348504 |
| 80455 | 497500.0 | 0.0499648 |
| 80466 | 552641.7 | 0.2866676 |
| 80481 | 499000.0 | 0.3135704 |
| 80501 | 425959.9 | 0.2012125 |
| 80503 | 698298.5 | 0.1868987 |
| 80504 | 510186.7 | 0.1960928 |
| 80510 | 294200.0 | 0.3622939 |
| 80516 | 617402.3 | 0.2375731 |
| 80540 | 491155.0 | -0.3172851 |
ggplot(testError_by_zips) +
geom_point(aes(meanPrice, MAPE)) +
geom_smooth(method = "lm", aes(meanPrice, MAPE), se = FALSE, colour = "green") +
labs(title = "MAPE by Zip Code as a function of mean price by Zip Code",
x = "Mean Home Price", y = "MAPE") +
plotTheme()
Is the model generalizable?
boulder_tracts19 <-
get_acs(geography = "tract", year = 2019,
variables = c("B01001_001E","B01001A_001E","B06011_001"),
geometry = TRUE, state = "CO", county = "Boulder", output = "wide") %>%
st_transform('ESRI:102254') %>%
rename(TotalPop = B01001_001E,
NumberWhites = B01001A_001E,
Median_Income = B06011_001E) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(boulder_tracts19),
aes(fill = raceContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(boulder_tracts19),
aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"),
name="Income Context") +
labs(title = "Income Context") +
mapTheme() +
theme(legend.position="bottom"))
Is this an effective model? What were some of the more interesting variables? How much of the variation in prices could you predict? Describe the more important features? Describe the error in your predictions? According to your maps, could you account the spatial variation in prices? Where did the model predict particularly well? Poorly? Why do you think this might be?
Would you recommend your model to Zillow? Why or why not? How might you improve this model?